-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
… test Conflicts: tests/python/gpu/test_operator_gpu.py
python/mxnet/optimizer/optimizer.py
Outdated
@@ -781,6 +784,240 @@ def update(self, index, weight, grad, state): | |||
ftml_update(weight, grad, prev_d, prev_v, prev_z, out=weight, | |||
lr=lr, wd=wd, **kwargs) | |||
|
|||
@register | |||
class SGDwFastLARS(Optimizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this different from LBSGD defined here - https://github.com/apache/incubator-mxnet/pull/16122/files#diff-0c893416e9e93fbd94dfaa9fa6c13d67R1022
Currently LBSGD implementation is buggy, maybe this should replace that, rather than create SGDwFastLARS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for quick review.
LB stand for "Large batch" indicating that it will use diverse techniques. As warmup could be implemented directly with lr_scheduler (LARS contrarily to what LBSGD indicate is not a warmup strategy but an optimizer). I think the only reason to use such name would be to have the choice between LARS and LARC optimizers (which will be coming in another PR), or even other techniques for large batch.
I think it could make sense as both implementation are very similar although I'm not sure it's ergonomic to have the selection of optimizer as an argument of Optimizer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i am not suggesting we have a parameter that will switch the optimizer from LARS to LARC. Maybe we could just deprecate LBSGD( or mark it for deprecation), since it does not work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think deprecation is the best option, please reviewers thumb up / down this comment for opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in favorof deprecation of LBSGD
, but still I would prefer a better name - "Fast" in "SGDwFastLARS" does not really make sense (fast compared to what?). Maybe just LARSSGD
or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also just call them LARC or LARS.
As a point of reference, TF calls it LARSOptimizer:
https://www.tensorflow.org/api_docs/python/tf/contrib/opt/LARSOptimizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed it LARS, and also remove the "lars" redundant prefix from eta
and eps
let me know if it's fine like that
python/mxnet/optimizer/optimizer.py
Outdated
@@ -781,6 +784,240 @@ def update(self, index, weight, grad, state): | |||
ftml_update(weight, grad, prev_d, prev_v, prev_z, out=weight, | |||
lr=lr, wd=wd, **kwargs) | |||
|
|||
@register | |||
class SGDwFastLARS(Optimizer): | |||
def __init__(self, momentum=0.0, lazy_update=True, lars_eta=0.001, lars_eps=0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add documentation like here - https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/optimizer/optimizer.py#L727
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this good enough ?
… redundancy of 'lars' in the parameters
I'm not seeing this error on my local machine, what version of Python is on CI ? |
tests/python/gpu/test_optimizer.py
Outdated
return lenet | ||
|
||
@with_seed() | ||
def test_lars(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how long does this test take? should it be put in nightly tests instead of unit tests?
should we add a test in python/unittest/test_optimizer.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's taking 10s with my V100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we are actually training a network here, we should move this to nightly tests and have an optimizer test in python/unittest/test_optimizer.py
by building a mock optimizer like we have done for other optimizers( ref - https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_optimizer.py#L514).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I'm already testing the MXNet Ops in https://github.com/apache/incubator-mxnet/pull/16122/files#diff-4758fb9329d438de2836db2634a8f5f7R270-R422. which are the only non-python part of the optimizer, what my test is adding is to show that it's properly behaving: i.e allow you to train a network with a bigger batch. Making a fully python of it wouldn't help me testing that (the python could be wrong as well, let's say if I misunderstood the publication)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anirudhacharya Does swapping my training test to nightly test and just unit testing my MXNet Ops in /tests/python/gpu/test_operator_gpu.py
enough ? If yes I believe I'm all done for this PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Caenorst sorry, missed seeing this. i will take another look. And, yes it would be good to move time consuming tests to nightly tests.
I understand that the python mock optimizer could be wrong if our understanding of the paper is incorrect. Let us try to catch them in the code review process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apart from the above mentioned test issue, everything else lgtm.
I don't understand the errors in the CI, can somebody help me ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@Caenorst Thanks a lot for your contribution!
v = v.astype('float32') | ||
if rescale: | ||
v *= self.rescale_grad | ||
norm = NDnorm(v).asnumpy()[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intended? I thought having a blockingc call is bad
* add MXNet operator for fast LARS * add unit tests for fast LARS related MXNet Ops * fix preloaded_multi_* dtype inference, add SGDwFastLARS optimizer and test Conflicts: tests/python/gpu/test_operator_gpu.py * remove commented out cast from lenet5 model * fix lint * Add more documentation, change name of SGDwFastLARS by LARS, removing redundancy of 'lars' in the parameters * change optimizer code to be python2 retro-compatible * fix lint * replace push_back by emplace_back for cland-tidy
Description
Add a Fast implementation of LARS.
Checklist
Essentials
Changes
Comments
This code have been used for MLPerf v0.6 benchmarks. It's especially pertinent if you are training with small local batch size as optimizers usually don't scale with batch size.
Credits